On b-bit min-wise hashing for large-scale regression and classification with sparse data

نویسندگان

Rajen D. Shah

Nicolai Meinshausen

چکیده

Large-scale regression problems where both the number of variables, p, and the number of observations, n, may be large and in the order of millions or more, are becoming increasingly more common. Typically the data are sparse: only a fraction of a percent of the entries in the design matrix are non-zero. Nevertheless, often the only computationally feasible approach is to perform dimension reduction to obtain a new design matrix with far fewer columns, and then work with this compressed data. b-bit min-wise hashing [Li and König, 2011, Li et al., 2011] is a promising dimension reduction scheme for sparse matrices. In this work we study the prediction error of procedures which perform regression in the new lower-dimensional space after applying the method. For both linear and logistic models we show that the average prediction error vanishes asymptotically as long as q‖β‖2/n → 0, where q is the average number of non-zero entries in each row of the design matrix and β∗ is the coefficient of the linear predictor. We also show that ordinary least squares or ridge regression applied to the reduced data in a sense amounts to a non-parametric regression and can in fact allow us fit more flexible models. We obtain non-asymptotic prediction error bounds for interaction models and for models where an unknown row normalisation must be applied before the signal is linear in the predictors.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Minwise hashing for large-scale regression and classification with sparse data

We study large-scale regression analysis where both the number of variables, p, and the number of observations, n, may be large and in the order of millions or more. This is very different from the now well-studied high-dimensional regression context of “large p, small n”. For example, in our “large p, large n” setting, an ordinary least squares estimator may be inappropriate for computational,...

متن کامل

b-Bit Minwise Hashing for Large-Scale Learning

Abstract Minwise hashing is a standard technique in the context of search for efficiently computing set similarities. The recent development of b-bit minwise hashing provides a substantial improvement by storing only the lowest b bits of each hashed value. In this paper, we demonstrate that b-bit minwise hashing can be naturally integrated with linear learning algorithms such as linear SVM and ...

متن کامل

b-Bit Minwise Hashing for Large-Scale Linear SVM

Linear Support Vector Machines (e.g., SVM, Pegasos, LIBLINEAR) are powerful and extremely efficient classification tools when the datasets are very large and/or highdimensional, which is common in (e.g.,) text classification. Minwise hashing is a popular technique in the context of search for computing resemblance similarity between ultra high-dimensional (e.g., 2) data vectors such as document...

متن کامل

Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

Our recent work on large-scale learning using b-bit minwise hashing [21, 22] was tested on the webspam dataset (about 24 GB in LibSVM format), which may be way too small compared to real datasets used in industry. Since we could not access the proprietary dataset used in [31] for testing the Vowpal Wabbit (VW) hashing algorithm, in this paper we present an experimental study based on the expand...

متن کامل

Compressed Image Hashing using Minimum Magnitude CSLBP

Image hashing allows compression, enhancement or other signal processing operations on digital images which are usually acceptable manipulations. Whereas, cryptographic hash functions are very sensitive to even single bit changes in image. Image hashing is a sum of important quality features in quantized form. In this paper, we proposed a novel image hashing algorithm for authentication which i...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

On b-bit min-wise hashing for large-scale regression and classification with sparse data

نویسندگان

چکیده

منابع مشابه

Minwise hashing for large-scale regression and classification with sparse data

b-Bit Minwise Hashing for Large-Scale Learning

b-Bit Minwise Hashing for Large-Scale Linear SVM

Training Logistic Regression and SVM on 200GB Data Using b-Bit Minwise Hashing and Comparisons with Vowpal Wabbit (VW)

Compressed Image Hashing using Minimum Magnitude CSLBP

عنوان ژورنال:

اشتراک گذاری